df_final
| Country | Year | CO2 emissions | Population | Continent | |
|---|---|---|---|---|---|
| 0 | World | 1971 | 67.974998 | 71.190002 | NaN |
| 1 | World | 1972 | 71.264000 | 72.639999 | NaN |
| 2 | World | 1973 | 75.351997 | 74.102997 | NaN |
| 3 | World | 1974 | 75.197998 | 75.533997 | NaN |
| 4 | World | 1975 | 75.483002 | 76.933998 | NaN |
| ... | ... | ... | ... | ... | ... |
| 8587 | Oceania | 2014 | 143.311996 | 137.014008 | NaN |
| 8588 | Oceania | 2015 | 145.908997 | 139.089005 | NaN |
| 8589 | Oceania | 2016 | 148.843002 | 141.373993 | NaN |
| 8590 | Oceania | 2017 | 150.341995 | 143.796005 | NaN |
| 8591 | Oceania | 2018 | 149.811005 | 146.080994 | NaN |
8592 rows × 5 columns
df_merge
| Country | Year | CO2 emissions | Population | Continent | Country Code | Population growth (annual %) | |
|---|---|---|---|---|---|---|---|
| 0 | World | 1971 | 67.974998 | 71.190002 | NaN | WLD | 2.133117 |
| 1 | World | 1972 | 71.264000 | 72.639999 | NaN | WLD | 2.031211 |
| 2 | World | 1973 | 75.351997 | 74.102997 | NaN | WLD | 1.982943 |
| 3 | World | 1974 | 75.197998 | 75.533997 | NaN | WLD | 1.929549 |
| 4 | World | 1975 | 75.483002 | 76.933998 | NaN | WLD | 1.855834 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 6079 | United Arab Emirates | 2014 | 340.184998 | 504.048004 | Asia | ARE | 0.176775 |
| 6080 | United Arab Emirates | 2015 | 359.829010 | 506.729004 | Asia | ARE | 0.527292 |
| 6081 | United Arab Emirates | 2016 | 370.427002 | 512.090027 | Asia | ARE | 1.053271 |
| 6082 | United Arab Emirates | 2017 | 386.880005 | 518.981995 | Asia | ARE | 1.339470 |
| 6083 | United Arab Emirates | 2018 | 371.341003 | 526.859985 | Asia | ARE | 1.503938 |
6084 rows × 7 columns
fig1.show()
fig2.show()
fig3.show()
fig4.show()
fig5.show()
with $\bar{x}$ and $\bar{y}$ the sample means.
Moreover, given the sample correlation coefficient $r$, the following hypothesis test is performed $$ H_0: \rho = 0, \;\; H_1: \rho \neq 0,$$ with $\rho$ the real correlation coefficent.
stats.pearsonr(df_merge_2018_1['CO2 emissions'], df_merge_2018_1['Population growth (annual %)'])
(0.48855680323340167, 1.7316896372956907e-08)
$r = 0.49 \Rightarrow$ moderate linear relationship $(0.3 \leq |r| \leq 0.5)$;
$p\text{-value} = 1.73\text{e}-8 \Rightarrow$ very strong evidence to reject the null hypothesis $H_0: \rho = 0$.
Let $K$ be the desired number of clusters, $C_1,\ldots, C_K$ the indices set of the observations in each cluster such that $\bigcup_{i=1}^K C_i = \{1, \ldots, n\}$ and $C_k\; \cap \;C_{k^{'}} = \emptyset,\; \forall k \neq k^{'}$, with $n$ denoting the total number of observations.
Let $W(C_k)$ be the within-cluster variation defined using the squared Euclidean distance. That is, $$ W(C_k) = \frac{1}{|C_k|}\sum_{i, i^{'} \in \; C_k}\;\sum_{j = 1}^{p}(x_{ij} - x_{i^{'}j})^2,$$ with $|C_k|$ indicating the number of measurements in the $k$th cluster, and $p$ the number of features.
The objective of $K$-means clustering is to find a clustering such that the within-cluster variation $W(C_k)$ is the smallest as possible. That is, \begin{equation} \label{eq:kMeansObjective} \min_{C_1, \ldots, C_K} \sum_{k=1}^{K} \frac{1}{|C_k|}\sum_{i, i^{'} \in \; C_k}\;\sum_{j = 1}^{p}(x_{ij} - x_{i^{'}j})^2 . \end{equation}
The solution of the algorithm is a local optimum rather than a global one $\Rightarrow$ The algorithm is run several times (i.e., 10), and the best solution is selected.
The two features 'CO2 emissions' and 'Population growth (annual %)' are measured in two different scales $\Rightarrow$ They are scaled to have standard deviation 1 so that they have the same impact on the computation of the distances.
fig7.show()
fig8.show()
fig9.show()